Local Grammars and Compound Verb Lemmatization in Serbo - Croatian

نویسنده

  • Cvetana Krstev
چکیده

The increasing production of electronic (digital) texts (either on the Web or in other electronically available forms, such as digital libraries or archives) demands appropriate computer tools that can help human users in text manipulation and, additionally, in performing automatic processing of language resources. In the first place, a natural language processing (NLP) system needs to implement models for recognition and isolation of various lexical constituents that occur in a digital text. The main purpose is to isolate and mark syntactical units for further analysis. In this paper we will analyze a method for modeling compound verb forms in Serbo-Croatian, and, consequently, an approach to automatic recognition of string sequences in a digital text which represent such forms. NLP processing of a text in a highly inflective language needs to include thorough lexical preprocessing and lemmatization, that must take into account various inflected forms of words. The goal of lemmatization is to determine a lemma for each textual word as well as the appropriate grammatical information that corresponds to it. For example, the process of lemmatization for nouns usually assigns to a textual word its nominative form and corresponding grammatical information (e.g. gender, number, case and other properties). The main objective of the paper is to analyze methods for lemmatization of compound verb forms that occur in a digital text. However, as parts of compound verb forms can be locally distributed within a sentence, the lemmatization has to “bring together” parts of the same verb form and to determine the lemma (i.e. the infinitive form of the verb) and corresponding grammatical information (e.g. gender, person, tense). Unlike the lemmatization of nouns and adjectives, which has been exhaustively studied (cf. [3] and [12]), the references on (compound) verb lemmatization are not numerous, even for the languages that are not highly inflective. As a starting point for our research we have used a framework exposed in [2], dealing with compound forms of verbs in English. The structure of the paper is as follows. First, we will briefly discuss problems of compound verb lemmatization and the resources and tools that we used. Afterwards, we will present a case study of lemmatization of the Preterit Tense in Serbo-Croatian and some other examples of verb phrase lemmatization.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A cross-linguistic study of the processing of causative sentences.

The comprehension of sentences expressing instigative causation (e.g., The horse makes the camel run) was investigated in children between the ages of 2;0 and 4;4, speaking English, Italian, Serbo-Croatian and Turkish. Crosslinguistic differences in development reveal the roles of morphological (causative particle, case inflection) and syntactic devices (periphrasis, word order) in guiding chil...

متن کامل

Reducing Lexical Ambiguity in Serbo-Croatian

This paper presents an approach to acquisition of some lexical and grammatical constraints from large corpora using genetic algorithms. The main aim is to use these constraints to automatically define local grammars that can be used to reduce lexical ambiguity usually found in an initially tagged text. A genetic algorithm for computation of the minimal representation of grammatical features of ...

متن کامل

Clitics in Slavic

0. Introduction ................................................................................................................................... 2 1. Some general thoughts on Slavic clitics ..................................................................................... 3 2. Second position clitics ..........................................................................................

متن کامل

A model of the perception of Serbo-Croatian word tone

Purcell, 1979 presented data on the perception of Serbo-Croatian word tone by native speakers. The present paper develops a logistic regression model of the perception of Serbo-Croatian word tone using Purcell’s 1979 data. Two models are developed: an overall model and a two-part, split model. Model fits are calculated and plotted. The two-part model fits the perceptual data better. Model coeff...

متن کامل

Visual Word Recognition in Serbo-croatian Is Necessarily Phonological

In a naming task conducted with bi-alphabetic readers of Serbo-Croatian. it was shown that letter strings that can be assigned both a Roman and a Cyrillic alphabet reading incur longer latencies than the unique alphabet transcription of the same word. and that the magnitude of the difference depended on the number of ambiguous characters in the ambiguous letter string. While this wi thin-word p...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002